We’ll use the Credit dataframe from the ISLR package to demonstrate multiple regression with:
A numerical outcome variable y, in this case credit card balance.
Two explanatory variables:
A first numerical explanatory variable \(x_1\). In this case, their credit limit.
A second numerical explanatory variable \(x_2\). In this case, their income (in thousands of dollars).
Note: This dataset is not based on actual individuals, it is a simulated dataset used for educational purposes.
Let’s load the Credit data and:
Use the View command to look at raw data.
Now select() only Balance, Limit, Income, Rating and Age variables. (We will be using Rating and Age in a forthcoming exercise)
# Write your code here
#
library(ISLR)
Credit2 <- Credit %>%
select(Balance, Limit, Income, Rating, Age)
str(Credit2)
'data.frame': 400 obs. of 5 variables:
$ Balance: int 333 903 580 964 331 1151 203 872 279 1350 ...
$ Limit : int 3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
$ Income : num 14.9 106 104.6 148.9 55.9 ...
$ Rating : int 283 483 514 681 357 569 259 512 266 491 ...
$ Age : int 34 82 71 36 68 77 37 87 66 41 ...
Let’s look at some summary statistics for the variables that we need for the problem at hand.
# Write your code here
#
Credit2 %>%
select(Balance, Limit, Income) %>%
skim()
| Name | Piped data |
| Number of rows | 400 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Balance | 0 | 1 | 520.01 | 459.76 | 0.00 | 68.75 | 459.50 | 863.00 | 1999.00 | ▇▅▃▂▁ |
| Limit | 0 | 1 | 4735.60 | 2308.20 | 855.00 | 3088.00 | 4622.50 | 5872.75 | 13913.00 | ▆▇▃▁▁ |
| Income | 0 | 1 | 45.22 | 35.24 | 10.35 | 21.01 | 33.12 | 57.47 | 186.63 | ▇▂▁▁▁ |
Let’s also look at _______________ as visual aids.
library(cowplot)
p1 <- ggplot(data = Credit, aes(x = Balance)) + geom_histogram(binwidth = 150, color = "white", fill = "orchid4")
p2 <- ggplot(data = Credit, aes(x = Limit)) + geom_histogram(binwidth = 800, color = "white", fill = "orange")
p3 <- ggplot(data = Credit, aes(x = Income)) + geom_histogram(binwidth = 10, color = "white", fill = "steelblue")
plot_grid(p1, p2, p3)
We observe for example:
Since our outcome variable Balance and the explanatory variables Limit and Income are numerical, we can and have to compute the correlation coefficient between pairs of these variables before we proceed to build a model.
#Write the code to find the correlations:
Credit %>%
select(Balance, Limit, Income) %>%
cor()
Balance Limit Income
Balance 1.0000000 0.8616973 0.4636565
Limit 0.8616973 1.0000000 0.7920883
Income 0.4636565 0.7920883 1.0000000
Balance with Limit is 0.862. This indicates a strong positive linear relationship, which makes sense as only individuals with large credit limits can accrue large credit card balances.
Balance with Income is 0.464. This is suggestive of another positive linear relationship, although not as strong as the relationship between Balance and Limit.
As an added bonus, we can read off the correlation coefficient of the two explanatory variables, Limit and Income of 0.792. In this case, we say there is a high degree of collinearity between these two explanatory variables.
Note: Collinearity (or multicollinearity) is a phenomenon in which one explanatory variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. So in this case, if we knew someone’s credit card Limit and since Limit and Income are highly correlated, we could make a fairly accurate guess as to that person’s Income. Or put loosely, these two variables provided redundant information. For now let’s ignore any issues related to collinearity and press on.
Let’s visualize the relationship of the outcome variable with each of the two explanatory variables in two separate plots:
sc1 <- ggplot(Credit, aes(x = Limit, y = Balance)) +
geom_point() +
labs(x = "Credit limit (in $)", y = "Credit card balance (in $)",
title = "balance and credit limit") +
geom_smooth(method = "lm", se = FALSE)
sc2 <- ggplot(Credit, aes(x = Income, y = Balance)) +
geom_point() +
labs(x = "Income (in $1000)", y = "Credit card balance (in $)",
title = "balance and income") +
geom_smooth(method = "lm", se = FALSE)
plot_grid(sc1, sc2)
To get a sense of the joint relationship of all three variables simultaneously through a visualization, let’s display the data in a 3-dimensional (3D) scatterplot, where
The numerical outcome variable \(y\) Balance is on the z-axis (vertical axis)
The two numerical explanatory variables form the “floor” axes. In this case
# draw 3D scatterplot
p <- plot_ly(data = Credit, z = ~Balance, x = ~Income, y = ~Limit, opacity = 0.6, color = Credit$Balance) %>%
add_markers()
p
Exercise
Conduct a new exploratory data analysis with the same outcome variable \(y\) being Balance but with Rating and Age as the new explanatory variables \(x_1\) and \(x_2\). Remember, this involves three things:
What can you say about the relationship between a credit card holder’s balance and their credit rating and age?
We now use a + to consider multiple explanatory variables. Here is the syntax:
model_name <- lm(y ~ x1 + x2 + ... +xn, data = data_frame_name)
Balance_model <- lm(Balance ~ Limit + Income, data = Credit)
Balance_model
Call:
lm(formula = Balance ~ Limit + Income, data = Credit)
Coefficients:
(Intercept) Limit Income
-385.1793 0.2643 -7.6633
# Or use one of the followings to see more info...
get_regression_table(Balance_model)
# A tibble: 3 x 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept -385. 19.5 -19.8 0 -423. -347.
2 Limit 0.264 0.006 45.0 0 0.253 0.276
3 Income -7.66 0.385 -19.9 0 -8.42 -6.91
#summary(Balance_model)
Write your model here:
*
How do we interpret these three values that define the regression plane?
Intercept: -$385.18 (rounded to two decimal points to represent cents). The intercept in our case represents the credit card balance for an individual who has both a credit Limit of $0 and Income of $0.
Limit: $0.26. (Now that we have multiple variables to consider, we have to add a caveat to our interpretation:) Holding all the other variables fixed (Income, in this case), for every increase of one unit in credit Limit (dollars), there is an associated increase of on average $0.26 in credit card balance.
Income: -$7.66. Similarly, Holding all the other variables fixed (Limit, in this case),, for every increase of one unit in Income (in other words, $10,000 in income), there is an associated decrease of on average $7.66 in credit card balance.
WAIT! Did something go wrong? Interpretation of the
Incomecoefficient is alarming.
Recall in individual scatterplots that when considered, both Limit and Income had positive relationships with the outcome variable Balance. As card holders’ credit limits increased their credit card balances tended to increase as well, and a similar relationship held for incomes and balances. In the above multiple regression, however, the slope for Income is now -7.66, suggesting a negative relationship between income and credit card balance. What explains these contradictory results?
This is known as **Simpson’s Paradox**, a phenomenon in which a trend appears in several different groups of data but disappears or reverses when these groups are combined.
As we did previously in ch 5, let’s look at the fitted values and residuals.
# get the following table
regression_points <- get_regression_points(Balance_model)
regression_points
# A tibble: 400 x 6
ID Balance Limit Income Balance_hat residual
<int> <int> <int> <dbl> <dbl> <dbl>
1 1 333 3606 14.9 454. -121.
2 2 903 6645 106. 559. 344.
3 3 580 7075 105. 683. -103.
4 4 964 9504 149. 986. -21.7
5 5 331 4897 55.9 481. -150.
6 6 1151 8047 80.2 1127. 23.6
7 7 203 3388 21.0 349. -146.
8 8 872 7114 71.4 948. -76.0
9 9 279 3300 15.1 371. -92.2
10 10 1350 6819 71.1 873. 477.
# … with 390 more rows
ggplot(Balance_model, aes(x = .fitted, y = .resid)) + geom_point()
Kevin has a credit limit of $5080 and his income is $150,000. Use the Balance_model to predict Kevin’s balance.
newx <- data.frame(Limit = _____, Income = ____)
predicted_balance <- predict(Balance_model, newx)
predicted_balance
Exercise
Fit a new linear regression using where Rating and Age are the new numerical explanatory variables \(x_1\) and \(x_2\).
Let’s revisit the instructor evaluation data introduced in Ch 6.
Consider a modeling scenario with:
A numerical outcome variable \(y\). As before, instructor evaluation score.
Two explanatory variables:
A numerical explanatory variable \(x_1\): in this case, their age.
A categorical explanatory variable \(x_2\): in this case, their binary gender.
Let’s reload the evals data and select() only the needed subset of variables. Note that these are different than the variables chosen in Chapter 6. Let’s given this the name evals_ch7.
evals_ch7 <- evals %>%
select(score, age, gender)
View() and the glimpse() functions.str(evals_ch7)
Classes 'tbl_df', 'tbl' and 'data.frame': 463 obs. of 3 variables:
$ score : num 4.7 4.1 3.9 4.8 4.6 4.3 2.8 4.1 3.4 4.5 ...
$ age : int 36 36 36 36 59 59 59 51 51 40 ...
$ gender: Factor w/ 2 levels "female","male": 1 1 1 1 2 2 2 2 2 1 ...
skim() function from the skimr package:evals_ch7 %>%
skim()
| Name | Piped data |
| Number of rows | 463 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1 | FALSE | 2 | mal: 268, fem: 195 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| score | 0 | 1 | 4.17 | 0.54 | 2.3 | 3.8 | 4.3 | 4.6 | 5 | ▁▁▅▇▇ |
| age | 0 | 1 | 48.37 | 9.80 | 29.0 | 42.0 | 48.0 | 57.0 | 73 | ▅▆▇▆▁ |
Describe the output: Homework
score and age. Recall that correlation coefficients only exist between numerical variables.evals_ch7 %>%
get_correlation(formula = score ~ age)
# A tibble: 1 x 1
cor
<dbl>
1 -0.107
We observe that they are _______ and __________ correlated.
Now, let’s try to visualize the correlation.
Create a scatterplot of score over age. Use the binary gender variable to color the point with two colors. Add a regression line (or two?) in to your scatterplot.
ggplot(evals_ch7, aes(x = age, y = score, color = gender)) +
geom_point() +
labs(x = "Age", y = "Teaching Score", color = "Gender", title = "Graph using 'method = lm'") +
geom_smooth(method = "lm", se = FALSE) + theme_bw()
Say a couple of interesting things about the graph you’ve created.
Much like we started to consider multiple explanatory variables using the + sign in the previous section, let’s fit a regression model and get the regression table.
score_model_2 <- lm(score ~ age + gender, data = evals_ch7)
summary(score_model_2)
Call:
lm(formula = score ~ age + gender, data = evals_ch7)
Residuals:
Min 1Q Median 3Q Max
-1.82833 -0.33494 0.09391 0.42882 0.91506
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.484116 0.125284 35.792 < 0.0000000000000002 ***
age -0.008678 0.002646 -3.280 0.001117 **
gendermale 0.190571 0.052469 3.632 0.000313 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5343 on 460 degrees of freedom
Multiple R-squared: 0.03901, Adjusted R-squared: 0.03484
F-statistic: 9.338 on 2 and 460 DF, p-value: 0.0001059
get_regression_table(score_model_2)
# A tibble: 3 x 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 4.48 0.125 35.8 0 4.24 4.73
2 age -0.009 0.003 -3.28 0.001 -0.014 -0.003
3 gendermale 0.191 0.052 3.63 0 0.087 0.294
The modeling equation for this scenario is: (Write the model you obtained)
Write the model for male:
Write the model for female:
Let’s create the scatterplot of score over age AGAIN. Use the binary gender variable to color the point with two colors. Add a regression lines in to your scatterplot BUT, this time use the model(s) we created.
Interpretaions of the coefficients:
\(b_{male} = 0.1906\) is the average difference in teaching score that men get relative to the baseline of women.
Accordingly, the intercepts (which in this case make no sense since no instructor can have an age of 0) are :
for women: \(b_0= 4.484\)
for men: \(b_0 +b_{male} = 4.484 + 0.191 = 4.675\)
Both men and women have the same slope. In other words, in this model the associated effect of age is the same for men and women. So for every increase of one year in age, there is on average an associated decrease of \(b_{age} = -0.009\) in teaching score.
Warning! ⚠
But wait!!!, why is the lines in the original scatterplot different than the lines in this one? What is going on?
age and gender.Once you nitice an interaction effect on the original scatterplot, you will have to create a model with interaction…(next section)
We say a model has an interaction effect if the associated effect of one variable depends on the value of another variable. These types of models usually prove to be tricky to view on first glance because of their complexity. In this case, the effect of age will depend on the value of gender. (as was suggested by the different slopes for men and women in our visual exploratory data analysis)
Let’s fit a regression with an interaction term. We add an interaction term using the * sign. Let’s fit this regression and save it in score_model_interaction, then we get the regression table using the get_regression_table() function as before.
score_model_interaction <- lm(score ~ age + gender + age * gender, data = evals_ch7)
get_regression_table(score_model_interaction)
# A tibble: 4 x 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 4.88 0.205 23.8 0 4.48 5.29
2 age -0.018 0.004 -3.92 0 -0.026 -0.009
3 gendermale -0.446 0.265 -1.68 0.094 -0.968 0.076
4 age:gendermale 0.014 0.006 2.45 0.015 0.003 0.024
The modeling equation for this scenario is (Writing the equation):
\[\hat{y} = b_0 + b_1 \cdot x_1 + b_2 \cdot x_2 + b_3 \cdot x_1 \cdot x_2\]
\[\hat{score} = 4.883 - 0.018 \cdot age - 0.446 \cdot 1_{Male}(x) + 0.014 \cdot age \cdot 1_{Male}(x) \]
Write the model for male:
\[\hat{score} = 4.883 - 0.018 \cdot age - 0.446 \cdot 1 + 0.014 \cdot age \cdot 1 \] \[\hat{score} = 4.437 - 0.004 \cdot age \]
Write the model for female:
\[\hat{score} = 4.883 - 0.018 \cdot age - 0.446 \cdot 0 + 0.014 \cdot age \cdot 0 \] \[\hat{score} = 4.883 - 0.018 \cdot age \]
We see that while male instructors have a lower intercept, as they age, they have a less steep associated average decrease in teaching scores: 0.004 teaching score units per year as opposed to -0.018 for women. This is consistent with the different slopes and intercepts of the red and blue regression lines fit in the original scatter plot.